-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding regex replacement feature #202
base: main
Are you sure you want to change the base?
Adding regex replacement feature #202
Conversation
0addfd9
to
350d7ff
Compare
350d7ff
to
ac5a79f
Compare
Thanks for submitting this! I will have a closer look at this PR tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR! I have left a few comments, but nothing major. Please ping me once ready for another round of review and we can get this merged as soon as possible :)
README.md
Outdated
``` | ||
|
||
This will find words that glue two sentences and will add a space to un-glue them. | ||
And will split a long sentence in two smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good example and easily understandable, thanks for this thorough documentation. In the context of Wikipedia extracts, more sentences might actually mean less content, as a sentence might be fulfilling all rule requirements, but then gets split into two. And then only one of them gets picked. Of course this heavily depends on how many potential sentences a given article has. In many cases (such as yours), this might be beneficial, but it doesn't always have to be. Might be worth it to write a short explanation here for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where it is most worthwhile: If the article does not have enough sentences to select from (<3) because of the rules, especially max_words and/or max_characters. At that time, this algorithm can kick in and try to produce split sentences.
There is no way for us to know if pre-split or post-split can produce more "valuable" sentences. But many of the "sub-sentences" might be simple introductory wording etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and that's exactly why I would prefer a short sentence explaining that, so people don't just blindly copy. If we have an indication that it works in all corpuses, then we could also just do it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note on sentence splitting.
@@ -28,6 +28,19 @@ pub fn replace_strings(rules: &Rules, raw: &str) -> String { | |||
} | |||
} | |||
|
|||
// regex replacements | |||
for regex_replacement in rules.regex_replacement_list.iter() { | |||
if Value::as_array(regex_replacement).unwrap().len() == 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made me wonder if this implementation should go further than just with 3 values. Initially I thought such a regex implementation would only take two arguments and basically work like the replace_all
function. But thinking about it, I can absolutely see why 3 arguments can be even more helpful, though many use cases could also be covered by named capture groups (but not all!).
Would you be interested in implementing a second form of this that accepts two arguments and replaces every matched occurrence with that string? This of course could be done outside this PR as a follow-up.
Co-authored-by: Michael Kohler <[email protected]>
Co-authored-by: Michael Kohler <[email protected]>
Adding another replacement option, that can process regexs. This can be used to split longer sentences into smaller chunks.
In my tests for Latvian, this can yield ~25% more sentences in the final output.